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A Comparison of the Fairness of Adaptive and 
Conventional Testing Strategies 



In a previous report. Pine and Weiss (1976) examined how the item 
characteristics of a conventional test affect its fairness when used in a 
selection application. That study was concerned with the effects on fairness 
of (1) the degree of bias in the test items, (2) level of item discrimination, 
and (3) the distribution of item difficulties. It was found that when fairness 
was psychometrically operationalized in several ways, these characteristics 
of a conventional test influenced Ijs fairness. The implication of these 
results is that the average item discrimination and the distribution of item 
difficulties, as well as the distributions of test scores for the subgroups 
being tested, should be considered when evaluating the fairness of a conven- 
tional test used as a selection instrument. 

Recently, a new class of potential selection tests, referred to as 
tailored (Lord, 1970) or adaptive (Weiss & Betz, 1973) tests, has emerged. 
These tests function quite differently from conventional tests and, consequent- 
ly, may have quite different fairness properties* In adaptive testing each 
individual is sequentially administered a subset of items from a larger pool of 
items; each succeeding item administered is contingent upon the testee's 
responses to the preceding items (Weiss, 197A) » Therefore, each individual in 
a test population will typically receive a unique test, which differs from the 
tests administered to other people with respect to its average item discrimi- 
ratiifti and its distribution of item difficulties. Since Pine and Weiss (1976) 
have already shown that differences among these psychometric properties can 
affect the fairness of a test, it is appropriate to examine whether, and to 
what extent, adaptive testing will decrease or increase test fairness. 

Based on the current state of knowledge of the psychometric properties of 
adaptive tests, there are reasons to believe their use can increase test 
fairness in several ways. Adaptive tests generally produce smaller standard 
errors of measurement at the extremes of the ability continuum than do conven- 
tional tests (e.g.. Lord, 1970; Vale, 1975; Vale & Weiss, 1975), Civen the 
relationship between validity and the standard error of ability estimation 
(Jensema, 197A) , this^ implies that test validity for low^scoring individuals 
could be expected to increase with adaptive testing. This would be an impor- 
tant result » since members of minority groups tend to obtain low scores on 
many selection instruments, and evidence of low validity for a minority sub- 
group can be interpreted as indicating an unfair selection instrument* 

Adaptive tests achieve their higher levels of measurement precision 
(e,g,. Lord, 1970; Vale & Weiss, 1975) and test validity (Jensema, 197A; Vale 
& Wei^s, 1975) by tailoring test difficulty to each testee. That is, in an 
adaptive test the proportion of items answered correctly by each testee should 
approach a theoretically optimal level (,50 if correct answers by random guess- 
ing are not possible) . Given an adaptive test item pool with an adequate range 
of item dif ficulti(>s, the proportion of items answered r.orrectly would be 
similar for membeivs cf both minority and majority subgroups. Consequently t 
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test informaticm or precision would be equal* The result may be equal valid- 
ities for different subgroups and therefore a potential for reduction in 
unfairness, 

A third possible advantage of using adaptive testing would be in extending 
the principal of differential prediction to single test items- When conven- 
tional tests are used, differential prediction involves using separate within- 
group regression equations to predict the criterion performance of minority 
and majority subgroups. The logic behind this procedure is that the best 
prediction of criterion performance for a given subgroup is obtained by 
developing the prediction equation based only on data from the subgroup for 
which predictions are to be made. The logical extension of this procedure 
would be to predict criterion performance using test items which have been 
calibrated separately for each subgroup. Adaptive testing can accomplish this 
and, at the same time, adapt the difficulty of the test to the ability level 
of the examinee- 

Th<3 purpose of the present study was to compare Che properties of one 
adaptive testing procedure, Owen's (1969) Bayesian adaptive method, with the 
conventional tests previously studied by Pine and Weiss (1976)- Specifically, 
the investigation was concerned with (1) how item pools with varying item 
parameters and degrees of item bias would interact with the two test models to 
affect test fairness, (2) how the use of differential prediction within the 
context of each testing strategy affected test fairness, and (3) how the 
placement of the prior ability distribution and choice of a termination criter- 
ion affected fairness for the Bayesian strategy. 

Method 

Assw^tions 

^The above questions were investigated in the contexc of the same selection 
situation assumed in the previous study (Pine & Wt^iss, 1976)< The selection 
process was modeled by a monte carlo simulation; it consisted of administering 
a selection test to each hypothetical person and using the score from that test 
to predict an external criterion represented by generated values of the known 
latent trait. 9- The selection test was assumed to be completely described in 
terms of its latent trait parameters so that each of its items could be described 
in terms of its item discrimination (a), item difficulty (i>) , and probability 
of being answered correctly by chance guessing (a)* Some of the items in th^ 
test, however, were assumed to be biased against the minority subgroup; and 
the degree of item bias was expressed in terms of the latent trait item parameters, 

A testee's true ability level on the underlying latent trait, 8, was 
represented by a number randomly generated from a standard normal distribution. 
The same 6 values were used for both the minority and majority subgroups. The 
only distinction between the subgroups was how item responses were simulated* 
For the testees in the "minority" subgroup, the degree of item bias (see Pine 
^ Weiss, 1976» p. 8) was added to the item difficulty to reflect the fact that 
biased items are effectively more difficult for testees in minority subgroups. 
This had the effect of lowering the probability of a correct response on 
biased items for minority subgroup testees. 
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Each simulated testee was administered 18 conventional and 9 Bayeslan 
adaptive tests constructed from eighteen 100-item pools» described in Table 1. 
Table 1 gives the specifications of each pool in terms of its latent trait 
item parameters a and b (j was assumed to be .20 for all items) < Each item 
in each pool was additionally assumed to have a given level of subgroup bias, 
which was defined as the difference between the Item difficulty (b) parameters 
for a majority (maj) and minority (mln) subgroup. In the case of the conven- 
tional tester, items were taken sequentinlly from all 18 pools; for the Bayesian 
adaptive tests, items were selected in accordance with Owen's (1969) Bayesian 
item se^arch algorithm from only the 9 pools liaving a uniform distribution 
of dif t'icul t ies< 



Table 1 

Distributions of Item Difficulties, Levels of Item 
Discrimination (a), and Degree of Item Bias in the 
Simulated Item Pools 





Item 


Pool 










Difficulty 




Difficulty 




Bias 


No. 


Distribution 


No. 


Distribution 


a 


(b maj-b min) 


1 


Peaked 


10 


Un i f o rm 


.30 


.5 


2 


Peaked 


11 


Uniform 


.30 


1.0 


3 


Peaked 


12 


Uniform 


.30 


2.0 




Peaked 


13 


Uniform 


.70 


.5 


5 


Peaked 


U 


Uniform 


.70 


1.0 


6 


Peaked 


15 


Uniform 


.70 


2.0 


7 


Peaked . ' 


16 


Uniform 


1.10 


.5 


8 


Peaked 


17 


Uniform 


1.10 


1.0 


9 


Peaked 


18 


Uniform 


1.10 


2.0 



ACvp tive zsst ^ The Bayesian adaptive testing strategy (McBride & Weiss» 
1976; Owen, 1969; Urry, 1977) begins with an initial (prior) estimate o£ 6. 
In this study a normally distributed prior distribution having a mean of 0 and 
a standard deviation of 1<0 was used. The item to be administered to a testee» 
described by its latent trait item parameter^, is the item that minimizes the 
expected error function (O-O)"-* where 6 is the current estimate of ability and 
a function of the item parameters. Based on the current value of 6, the item 
parameters, and whether the response to the admini.stered Item was correct *or 
incorrect, the current 9 estimate is updated* This new estimate then becomes 
the current estimate, and the cycle is repeated until a termination criterion 
is reached* In this situdy Bayesiaa adaptive testing was termlnatn^d when a 
fixed number of items was administ ered< Consequent ly » the standard error of 
the Bayesian ability estimate varied for testees of different 9 levels. The 
average standard errors of these ability estimates was an additional dependent 
variable studied for the Bayesian testing strategy; it was compared to 
the theoretical value based on the fixed number of items administered (Jensema, 
l'>7A; Urry, 1977), 
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Criterion Prediction 



Each test was scored in two ways for predicting criterion perfonnance on 6. 
For the conventional tests, the regression equations for either the majority or 
minority subgroup were used to convert the total number correct score to the 6 
metric; for the Bayeaian adaptive tests^ criterion performance was predicted 
using Bayesian scoring and either the majority or minority subgroup (i+e. , 
biased) item parameters. When only the majority group regression equations or 
Item parameters were used to estimate 0, this was referred to as the majority 
prediction condition. It was contrasted with the differential prediction 
condition, which used the appropriate subgroups' regression equation or item 
parameters to estimate 9* In addition. In order to study the effect of test 
length, each test was scored after 10, 30, and 50 items had been administered. 

Faimiess 

Similar to the previous study (Pine & Weiss, 1976, pp. 9-12), selection 
fairness was evaluated by /f, the correlation between the predicted and true 
ability on the latent ' ability ; C, the difference between the mean ability levels 
of the minority and majority subgroups; and T; the difference between the 
proportion of- individuals exceeding the selection cutoff (set at the mean of 
the majority subgroup) in the two subgroups. In addition, a number of standard 
distributional statistics were also studied* These Included the mean, standard 
deviation* skewness, and kurtosls of the ability est ima tes(6) and > for the 
adaptive testing strategy, the standard error of its ability estimates. 

RESULTS 

Distributions of Predicted Scores 

Means, standard deviations, skewness, and kurtosis indices of ability 
estimates as a function of the experimental conditions are given for 50-item 
tests in Table 2; results for tests with 10 and 30 items, which generally 
parallel those for 50 items, are given in Appendix Tables A and B. In these 
tables the statistics for the true ability distribution (6) are given in the 
first row of the table, listed under the "True'' group heading. Only standard 
deviations are reported for conventional tests in the differential prediction 
condition, since the other distributional statistics are not affected by 
differential prediction. Table 2 also gives the results from the subcondition 
of Bayesian adaptive testing in which the mean of the assumed prior distribution 
used for the minority subgroup was varied. These conditions are based upon the 
a^l-l, bias=1.0 condition; in these cases the mean of the prior distribution 
was set at 6--1-0, -.23* or +1,0. 

Means 

As Table 2 shows, increasing item bias caused the mean of the minority sub- 
group to be underpredicted for all of the majority prediction conditions, 
regardless of the testing strategy employed. In the majority prediction ,situa- 
tion, this underprediction increased both with increasing item bias and with 
increasing item discriminations. For low item discriminations Ca^.30) and for 
the first two levels of bias (0.5 and 1-0), adaptive testing led to a larger 
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underprcd iction than did the convent ional tests . When d i f fL»rtMit ia t predict ion 
was Uiied , the adaptive test resulted in substantially less underpredict Lon than 
did either of the conventional tests. Furthermore, under differential predic- 
tion, the degree of underpredict ion produced by the adaptive test decreased with 
increasing Item discrimination (a) level.s. 

The mean of the assumed prior ability distribution Lnfluenucd the predicted 
mean ability levels in the adaptive test. This effect was substantially less 
when differential prediction was used> In the majority prediction situation, 
a prior of G=1.U increased the underpred ic t ion and n prior of +1.0 decreased it> 
The smallest degree of underpredict ion across all conditions was obtained when 
the prior was set at G=+1.0; using differential prediction, underpredict ion 
was reduced to nearly zero with this prior ability estimate, 

Fjtandcird Deviations 

The standard deviation of the ability distribution (1-01) was underpredict- 
ed both for the majority and minority subgroups in all testing conditions* Th.e 
effect of item discrimination on the standard deviation is reflected by the 
values for the majority subgroup, where item bias=0+ For the conventional tests, 
the standard deviations increased as the item discriminations were increased 
from '7=,30 to a=l>l. For the adaptive test, there was no increase for le\rels of 
item discrimination beyond a=-70. Within each level of item discrimination 
beyond a=-70tthe standard deviations decreased as the item bias was increased, 
for all testing conditions. The reduction in the standard deviations which 
resulted from increased item bias was more pronounced at the highest level of 
item discrimination. The uniform tests reflected this trend the least, and the 
peaked tests reflected it the most. 

The same general trends with respect to the influence of item discrimina- 
tion and bias on the standard deviations of the distributions of ability 
estimates occurred when differential predictT^JTtwas used* However, the influence 
of item bias was much less in this condition, particularly under adaptive test- 
ing, where the overall size of the standard deviations increased relative t-o 
the values obtained in the majority prediction condition* For example, where 
■7-1.1 and item bias increased from *5 to 2>0, the adaptive test had standard 
deviations of >86, ,78, and *60 in the majority prediction condition; with 
differential prediction, a corresponding increase in bias produced standard 
deviations of ,93, --90, and .90* The placement of the prior distribution in 
the adaptive test influenced the size of the standard deviation in both the 
majority and differential prediction c^^nditioiis+ In both cases the +1.0 mean 
prior produced a smaller standard deviation than did the -1-0 mean prior, 



For all testing conditions, the degrees of skewness and V^irtosis tended to 
incr(j;iS(j In a positive direction as butii i^cm d Lsi,:r im mat ion atul ir.t.'m bias 
Increased (see Table 3)- Positive values of skewnt'^ss indiccn:e that tiie mode of 
tho distribution is lower than its arithmetic mean, whi le - po^ it Ive values of 
kurtosis indicate that the distribution is more peaked than r normal distribution 
In the majority pr edii^^t Ion condition, the adaptive test prcMJuc^;! score distrib;i- 
t ions tliat were alway;^ more positively skewed and peaked than any of tlie uniform 
convent ional tosts . Compared to the peaked tests , however, the adapt Ive test 
was mt^re [;ositivclv sktrwed and peaked oniy at the Jower levi^l'^; nf itrm bia.s. In 
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the combined high item discrimination and high bias conditions, the pealced 
tests were considerably more leptokurtic than either the uniform conventional 
or the adapt ive te«tfi . 



Table 3 

5kewness and Kurtosis of Distribution of Ability Estimates for 50-Itenj Uniform (U) 
and Peaked (P) Conventional Tests and the Bayesian Adaptive Test (BAT) as a 
Function of Item Discrimination (a) and Degree of Item Bias, Using Majority and 
Differential Prediction, for Majority (maj) and Minority (min) Subgroups 
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The 5ame general trends with respect to the influence of item discrimina- 
tion and bias on skev^Ttess and kurtosis occurred when the differential prediction 
version of the adaptive test was used. However, as for all the other distribu- 
tional statistics, differential prediction greatly reduced the influence of 
item bias and discrimination on skei^Ttess and kurtosis. For example, using 
differential prediction the ske^^noss and kurtosis^ of the distribution were 
always equal to or -lower than those obtained in chemajorlty prediction condi- 
tion. Conpnrin:; the influence or the +1.0 pr^or resulted in a more positively 
skeVecl di ^ t r ihiit Lon , well as a rLore leptokurcic distribution, than did the 
-1 . 0 ne^in orior, , 



^ ' ^ ''- ^ ' - ^ : ' " Tlio validity coefficients (i.e., the correlations 
betwe^^-n tr'^c^ .inJ t'stin^^ted ability level?) for the uniform: and peaked convention- 
al I tost s and c ht^ .^dapt ive Le<;tt^, for all t^-xoer irontal conditions , are shown in 
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Table 4. The three rows in Table 4 labeled "maj" give the validities for the 

majority subgroup for each value of item discrimination (a). These results 

correspond to. the case in which item bias is zero. The rows labeled "min" and 

"Diff*' give* respectively* the validities for the minority subgroup and the 

difference between corresponding majority and minority values, for each 

combination of item discrimination and item bias. In the half of the table 

labeled "Differential Prediction," only the validity values from the adaptive 

tests are given. This ic because differential prediction in the conventional 

testing condition amounts to a linear transformation of the test scores and 

therefore would not change the correlation coefficients* The last six rows of * 

the table give the results from the subcondition of the adaptive test in 

which the mean of the assumed prior used for the minority subgroup was varied* 

The validities for adaptive tests increased with increasing test length and 
item discrimination. For instance, the lowest validity, r^.501, occurred for a 
10-itera test with a=*30 and the highest, r**.968, was for a 50-item test with 
a=l.l. A comparison of corresponding validities between the adaptive test and 
either type of conventional test for 10-item testf: s!*owfec? that the validities 
of the adaptive tests were higher in almost all cases in which item discrimina- 
tion was . 70 or higher. For example, for a lO-ictim test with a=l*l and item 
bias of 0.0, the validity correlations were *869 for the peaked test, .820 for 
the uniform test, and .881 for the adaptive test. For 30-- and 50--item tests, 
the adaptive test had a Icwer validity for many of the lower discriminating 
items; but at a=l.l the. adaptive test produced consistently , higher validities 
for all item pools. 

Differential validity. A major concern with respect to test fairness is 
not just how validity varies as a function of the test characteristics for 
a given subgroup but, more Importantly, how validity varies differentially 
between subgroups. The reason for this is that if a difference in subgroup 
validities does exist, the predictions made on the basis of the nest scores are 
not as accurate for one subgroup as for the other. Therefore, the effect of 
item bias on validity was studied by comparing the validities for both subgroups 
for all the item pools and test lengths. To facilitate this analysis, differ- 
ences between subgroup validities were determined. Differential validity was 
thus defined as 

diff min raaj 

A negative value of differential validity indicates that the majority subgroup 
had a higher validlt> coefficient than the minority subgroup. These values 
appear in Table A in the rows designated "Diff/^ 

Table A shows that for the lowest a value (a=*3) as item bias increased, 
validity differences increased for the uniform and adaptive tests but decreased 
for the peaked test. Also, at q=.30 differential validity tended to be posi- 
tive (i.e. J minority subgroup validities were hightir for all test types), with 
the largest values tending to occur for the adaptive test. However, for item 
discriminations of a*. 70 and 1*1 for test lengthf? of 30 and 50, the direction 
of differential validity reversed* so that higher validity correlations were 
observed for the majority subgroup. As the degree of item bias and item 
discrimination increased, the size of the negative difference became substantial 
for the peaked test relative to the adaptive test, while the uniform tests 
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generally produced a slightly smaller negative validity difference than did the 
adaptive test, For^e^ample , the 50-iCem peaked test with a-l<l and bias of 
2.0 had a --.123 difference between the subgroup validities, compared to values 
of -,028 for the uniform test and for the ada^ive test, 

1 f 

The effect of choice of prior distribution on the validity of the adaptive 

procedure was that when priots other than the majority subgroup prior were 
used, validity tended to Increase as the priors Hbecan>c higher in negative value. 
Since the priors were varied only for the minority sub^^roup, the effect on 
differential validity (i.e., the difference between ma;5^rity and minority 
subgroup validity coefficients) was that the negative priors produced differen- 
tial validities closer to zero than in the positive prior case, 

- Dif^fer^ential Preiiotio>7 Condition 

Subgrou:: vat i diti^s . At icem discrimination levels beyond a=<70, the 
differencial prediction version of the adaptive test produced higher validities 
than did che majority version of tile adaptive test or either convent ional^ test . 
This relative advantage Increased as Item bias increased. Typical are tlie 
values for a ,50-item adaptive test with a=l<l and bias of l-0,^wher& r=,966 
under differential prediction and r=.958 under niajority prediction;' this can 
be compared with >^=<95A and r*=<931, respectively, for the uniform and peaked 
conventional tests. Under these same conditions, the validity for the adaptive 
test increased to r=<968 by using the -1<0 mean prior for the minority subgroup. 

D"! fferential validity . Validity differences between subgroups were 
reduced by using the differential prediction version of the adaptive test. 
Unlike the majority prediction case, in which the uniform conventional tests 
often showed less differential validity than the aaaptive test, in this 
condition the adaptive test ^generaily had the smaller differential validity. 
Furthermore, the advantage of the adaptive test increased as item bias in- 
creased. The negative mean priors tended to increase differential validity for 
the 10-item test but had relatively little effect on the longer tests. The 
+1-0 mean prior, however, led to an increase in differential validity ac all 
chree test lengths. 



C-Fniymei^js 



ria,ioY^it'i F'redij\ io?z 



The Cleary-type fairness measure, (f, was defined as the mean of the predict 

ed ability, 6, minus the mean of the true ability, 6 (Pine £i. Weiss, 1976, p. 10) 
Therefore, the C-index Is in the same units as , which liad a mean of 0 and a 
standard deviation of 1,0< 

Ln Fi^^ure 1 values of '^(j^ff'' ^^^^ subgroup differences in tl'ie T-indices 

{7 . - C are plotted against test length for the uniform, peaked conven- 

min maj 

tional, and adaptive tests for alJ Jevels of Item discrimination in the majority 
prediction condition, (Numerical values of C by subgroup are given in Appendix 
Table C- ) It can easily be seen by some simple algebraic manipulation ( sub-- 

stitutin^ [^^ . - . ] for , , [0 . - ^ .1 for ^ and subtractint;) that 
mm min min maj maj maj 
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^diff ^ ^min " ^maj ^^^^^^^ ^^^^ ^c^th subgroups had the same mean true ability 

level, 9). Therefore, a negative value of C implies unfairness to the 

dirt 

minority subgroup in the sense that their mean ability is underpredicted 
relative to the predicted mean ability of the majority subgroup* 



Figure 1 
idex (t7^^j 

Discrimination (a). Item Bias, and Test Length, Using Majority Prediction 
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The C-lndex indicated Increased unfairness for the majority subgroup 
(higher negative values of ^^^j^ff^ ^® ^^^^ \^^^s was increased from .5 to 2.0. 

This trend Increased as a negatively accelerating function of test length, with 
the rate of increase varying as a function of item discrimination and degree 
of item bias. For all tests, increasing item bias tended to be associated with 
higher levels of C^^^^ for longer tests within a level of item discrimination. 

This effect of test length decreasedi however.^^as item discrimination increased. 



There were small differences between tests on C^^^^ at the 



,30 level 



of item discrimination. For 30- and 50-item tests, the adaptive test generally 



had lower levels of ^ 



dif f 



However, at the higher levels of item discrimination 



ERLC 



and item bias» the adaptive test showed substantial advantage over either 
conventional test in producing lower levels of ^j^^f^ ^or example, for a 

30-item reiJt with a=Kl and item bias of 2*0» the adaptive test produced ability 
etitim^tos one-third of ^ standard deviation Less biased than those produced 
by uniform conventional test. 

/ Thti choice of a prior distribution for the minority subgroups In the 
adaj>tivc test directly affected the reiJulting values of ^^^^^^ The larger the 

menn of the prior ability distribution (in a positive direction)* the lower the 
valutas of '^j^ff* Increasing test length had the effect of reducing differences 

in f^j^^f 'Jue to the different Bayeslan prior ability estimates. 



firntjp I) i f ft^rL'nr in *-Jndt'X f'^^f^) TuncLion of Item 

[) { Si' r i m i n,it i ■>n ( O i T t rm B [as ^ and Test Long th , 1!,^^ in^ 1) i C f erun t in \ 
Prt^J i c t i'^ri with tiie Bavesian Adaptive Test 
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Vaines of C.^^. are t^lotted In Figure 2 against te«t length for the 
□ iff' 

.adaptive test for all lev*:^ls of itcrm discrimination in the differential 
prtidlction condition. (Only the results of the adaptive tests are plotted, 
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since by definition ^^^^^ always equal to zero under differential prediction 

for conventional tests; see Pine ^ Ueiss* 1976, pp. 10-11.) As Figure 2 

indicates, when differential prediction was employed with the adaptive test, 

differences in the degree of unfairness between subgroups were practically 

eliminated, especially at the high levels of item discrimination. For a^.30 

there was a tendency for the minority subgroup to be overi>redicted (i.e., 

punitive values of C..^^)* This tendency, however, decreased as item discrimi- 

ai 1 1 

nation increased (Figures 2b and 2c). At a=l.l all ^^^^f values were practi- 
cally 7.9.10, the largest (which occurred at the shortest test length) being -.022 

The relationship of C^^^^ to test length when various priors x^ere used was 

very similar to that found in the majority prediction case, except ttiat values 
of "^^^^f^ were closer together and shifted down to n^ar the zero bias level. 

Because of this shiPit> the use of the +1.0 mean prior caused an overpredlct ion 
of the minority subgroup. Once again, as test length was increased, values of 
^diff ^^^"^^^"S from the differential priors Wf^re more similar. 



T-Faimess 



^j.^^ can be defined as the differences between the T'^indices for the 
dif f 

majority and minority subgroups and is equivalent to the difference between 
the percentage of minority and majority testees predicted to be above the 
majority group average (see Pine h Ueiss, 1976, p. 11). A negative value of 
7^^^^ indicates that the percent predicted to be above average was smaller for 

the minority than for the majority subgroup, i.e., the test was more unfair to 

the minority subgroup. Values of T'.jj;^ for the conventional and adaptive tests 

di t r 

are shovm in Figure 3. The numerical values of T by subgroup for all tests are 
shown in Appendix Table D. 

As Figure 3 shows. ^^^^^ varied in a complex way as a function of item 

di^;criminat ion , degree of item bias, and te^t length. The adaptive test 
showed smaller effects due to increases in t^^st length. Comparing the tests 
under different degrees of item bias and levels of item discrimination, the 
lO^item uniform tests were usually fairest (i.e., had stnallest values "^^^if^ 

when 0[=.3O and -70, regardless of level of item bias. In all other cases for 

a-. 70 and 1.1, the adaptive test produced levels of T.^.. closest to zero. At 

a IE L 

the highest level of discrimination (a^l.l) and bias (2.0) for a 50-item test^ 
^''^^^^=37.6;< for the adaptive test, A5.5% for the uniform conventional test, 

and 4A.2^*for the peaked conventional test. In terms of the percentage of 
oxaminees who would be judged above average, this implies a difference between 
the adaptive and uniform conventional tests of 7.8% in the number of minority, 
compared do majority, exrtminees. 

The effect of choosing a negative mean prior in the Bayesian adaptive test 
wns to produce a negative bias against the minority subgroup. Holding test 
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Figure 3 
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Item Discrimination (a). Item Bias, and Test Length, Using Majority Prediction 
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length constant and comparing ^^^^^ between levels of prior ability estimates, 

the differences were about 2% to 3% for 30- and 50-item tests and 9% to 12% 
for 10-item tests. However, these differences, due to choice of priors»were 
relatively small compared to the effect of varying item bias in the cases, in 
which 3 prior equal to the true mean ability was used. 

0 i £[c v<^ n K i a I Predi oti an 

The results of using differential prediction in conjunction with the 
adaptive test on T-fairness are shown in Figure (For simplicity, the results 

of d If f er€*nt ial prediction on conventional tests are not shown in Figure 4; but 
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their numerical values are given» along with chose from Che adaptive cestlng 
condition in Appendix Table D. A detailed discussion of the effects of 
differential prediction on conventional tests can be found in pine ^ Weiss* 
1976-) Appendix Table D shows that the main effect of ualng differential 
prediction with any of the three t^esting strategies was that a much larger 
percentage of minority applicants was predicted to be above average than In the 
majority prediction condition. Consequently^ the general level of unfairness 
was rC'duced using differential prediction- 



Figure A 

Group Differences in T-Index (T..^^) as a Function of 

a 1 1 r 

Item Discrimination (a). Item Bias» and Test Lengthy Using 
Differential Prediction with the Bayesian Adaptive Test 
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Figure 4 shows that with the differential prediction version of the 
adaptive test, the minority subgroup tended to have a greater ^percentage of 
examinees above the mean than did the majority subgroup,/ For example^ at the 
7=.30 Level of discrimination* a 10-item test having an average item bias of 
2.0 had 8.2% more of the minority subgroup above average than the majority 
subgroup. However, as item discrimination increased, this ovexprediction was 
reduced co the point where at a=l,l, '^^-cc values ranged between^+2,6% to 

This overprediction of the minority subgroup never occurred in the majority 
prediction ca^e for any other of the test strategies (see Figure 3), However^ 




a similar result did occur when differential prediction was used in conjunction 
with the uniform conventional tests* but only at the shorter test lengths or at 
the lowt^st level of item discriminat ior< Overpredic t ion of the minority 
subgroup almost never occurred with the peaked conventional test. 

When differential prediction was used, ^^^ff indicated that the adaptive 

test produced a fairer test than th'^ conventional test in all but 6 cases (oat 

of 18 possible) when discriminations were -30 and < 70 Csee^ Append ix T^ble D) . 

At the high discrimination level (a-Kl), there was only, one instance in which 

one of the convent ional te.^ts produced leve^ls of T.^^c nearer zero th.-an the 

d X 1 1 . ^ 

adaptive test (i-e<, a 30-item uniform test with bias of 1,0); and the 

difference there was small* X.f^T. for the uniform test compared to 1-8^/. for the 

adapt ive t es t < 

Contrary to what was found when majority prediction was used, the choi^ce 

of priors in the adaptive test had a relativelv large effect on 7\,^^ compared 

dif f 

to item bias. As was the case with the ^7-index* differential prediction some- 
times resulted in a positive bias in favor ot *the minority subgroup; this 
occurred primarily when the +1,0 prior was used. 



The Standard Evvov of Kstinatioyi in Bauesian Adaptive Testing 

The Bayesian adap t ive * t es t ing procedure provides ah estimate of the mean 
(0) and variance (5^) of the estimated ability distribution after each test 

item administered- In the present study, (and its square root,s^)varied 

across testees because a fixed test length termination criterion was employed. 
The average (the standard error of estimate) was computed'for each Experi- 
mental condition and compared to its actual value; the resulting ratios are 
plotted for majority and differential prediction conditions in Figures 5 and 6, 
respect ively< 

y/jjorit]^ Prediction ^ 

The average posterior standard deviation (^^) after m test items were 

administered underestimated the theoretical value of the standard error of 
estimate iJFF) in al I .condi t ion?; (see Figure 5)- This underestimation became 
progressively worse as item bias increased. For example, at a=l.l for a 50*- 
item test with bias of 2<0, the obtained ratio was -09i< Little difference 
in the ratios resulted from use of the negative mean priors. The +1<00 prior, 
however, produced a larger ratio at eachtest length, 

, K—M 

I 

The ratios for the differential prediction case are plotted in Figure 6, 
In this condition, the effect of item bias on the sizQ of the standard error 
ratios was greatly reducedn However, a ?im;ill effect due to bi'as was still in 
evidence, particularly at the lower discrimination levels. The effect due to 
tc^t length was also reduced, particularly at t=<30 (Figure 6a)< There was 
«till a systematic decrease in the ratios ^as a function of increasing discrimi- 
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Flgure 5 ' 

Ratio of Estimated Standard Error of Est imate' Based on / 
Bayesian Posterior Variance to Predicted Standard Error 
of Estimate as a Function of Item Discrimination (a) » 
Item Bias* and Test Length, Using Majority Prediction 

(a) (b) (c) (d) 

a=,30 a-. 70 o^l.l Bayesian 

Pri ors 

Degree of 
Item Bia8 
0 




No. Items No. Items ,No. Items No, Items 



nation- As In the majority prediction case, there was little difference due to 
use of negative mean prior ability estimates. The effect of using the +1-0 
prior^ however, was a relatively larger decrease in the standard error ratio 
than occurred in the majority prediction case. 
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Fligure 6 

Ratio of Estimated Standajrd Error of- Estimate Based on 
Bayesian Posterior Variance to Predicted Standard Error 
of Estimate as a Function of Item Discrimination (a). 
Item Bias, and Test Length, Using Differential Prediction 
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DISCUSSION 

In a previous study (Plne & Weiss, 1976), it was shown that the fairness 
of a test when used as a selection instrument will depen^3 on 'the item 
characteristics of the test items aftd on the way fairness is defined. The 
purpose of the present s^tudy was to extend t!iese results by investigating the 
effects on fairness of varying the testing strategy as well as the characteris- 
tics of the test items. » 

Shc^-e of the Predicted Abilitij Dzstributions 

The shapt^ of the test score distributions varied systematically as a 
function of the independent variables manipulated In this study. The effects 
of ttffo of these— -level of item discrimination and distribution of item diffi- 
culties^^have been studied rather extensively^ in previous research (Cronbach 
£i Warrington, 1952; Lord & Novick, 1968; Pine £- Weiss, 1976; Urry, 1969). 
However, the influence of type of testing strategy (i,e,, conventional versus 
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adaptLve) and use of differential prediction on the shape of score distributions 
has not received previous attention* 

SkeunesG and kurtosic . In general, the findin^gs were that as the degree 
of Lr.em bias and Item discrimination increased, the shape of the score distri- 
butions became increasingly positively skewed and flat relative to a normal 
distribution* These findings are consistent with Lord & Novick's (WBS, chap* 
16) graphical demonstration that increasing test discrimination w-l^ll tend to 
flatten the true score distribution, while increasing the difficulty of a test 
wlli cause positive skewness in the true score distribution* 

The shape of the ability score distributions was also a function of the 
testing strategy. The peaked conventional tasts were more strongly influenced 
by the presence of Itom bias and by increasing item discrimination than were 
the other test types* At the highest degree of bias and discrimination, the 
score distributions for the peaked conventional tests became exceedingly flat 
and positively skewed. In contrast', the differential prediction version of^^ 
the Bayesian adaptive test produced ability distributions relatively unchanged 
in shape acros^^ bias and discr iminat ion ^onditons . 

Mf^jgns and standard deviations . Both the means and standard deviations of 
the distributions tended to be underestimated* The underpredic tlon of the 
standard deviations is the direct result of regression towards the mean. 
The effect of item bias on the mean of the score distributions was also in the 
predicted direction, since a direct inverse relationship would be expected 
between degree of bias In test items and test scores* In terms of the 
means and standard deviations, the uniform conventional tests showed the 
least underpredic t Ion of the standard deviations, and the peaked conventional 
tei5ts generally resulted In the smallest underpredict ion of the means. How- 
ever, the condition in which values of the priar ability estimates of the adap- 
tive testing j^trategy were varied produced less underpredlctlon of both the 
mean and standard deviation of the ability distribution* Furthermore, when 
the differential version of 'the adaptive test was eniploy,ed» even lower levels 
of underpredlctlon resulted, 

7a I i'ii tj Iyui~:x 

The results of the validity data have iseveral implications for the 
construction of tests and the interpretation of ^existing test data. First, as 
was previously discussed (Pine & Weiss, 4,976), the validity results offer a 
possible explanation for the of ten-^reportcd but controversial phenomenon of 
differential validity. According to the model used in this study, the 
existence of subgroup validity differences (i*e., differential validity) is 
interpreted afi an Indication of the fairness of a selection instrument. The 
smal ler the dif f erence between validity coefficients , ^the fairer the selection 
instrument 

Several researchers (i.e. / Campbell, Crooks, Mahoney, & Rock, 1973; 
Sclimldt, Berner, & Hunter, 1973) have presented arguments, based on various 
analyses of empirical data, uhat differential validity does not exist as a 
substantive phenomenon. The results of this study indicate that differential 
validity is real and. In fact, can be expected when test items are biased 
against one of the subgroups being tested. Furthermore, based on the present 
study, it can be soen that the propertios of the testing strategy will also 
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Influence differential vnHdiLy, Both utmVent iona 1 ;ind adaptive testing stra- 
tegies had a direct influence on the extent to whit'h a ^iven degree of item 
bias affected differentia] validity* 

Variations^ within each of these testing strategies also affected differen- 
tial validity* Within tiie conventional tests, the distribution of item 
difficulties (peaked or unJforin) and item discrimination Level had a differen- 
tial effect* For the adaptive tests, the Influence of item bias on differential 
validity varied as a function of the level of item discrintlnation and choice 
of prior ability estimate, whether majority Item parameter.^ ot subgroup parame- 
ters (l*e., the majority or differential prediction condition) were used* 

ConVrjntionzl Vt^.rsus aiktptive tc^ts . Both the majority prediction version 
of the adaptive test, and the uniform conventional tests trended to produce a 
smaller differential validity for a given degree of item,»bias than did the 
peaked conventional test at the longer test lengths and higher levels of item 
discrimination. The adaptive test and uniform conventional test produced very 
similar level?; of differential validity when majority prediction and a zero 
prior ability estimate were used* However, the adaptive test produced higher 
minority subgroup validities and therefore a smaller differential validity when 
tlie prior ability estimate used for the minority subp,roup was one standard 
deviation below the mean of the majority subgroup. 

The! reason that using n negative prior led to higher minority subgroup 
validity was that In this condition the test items were biased by one standard 
dt^yLatlon on the difficulty scale. This resulted in the minority testees 
responding as though they were one standard del/i^tion below their true ability 
level. Therefore, the -1.00 prior ability estimate more closely matched their 
effective nean ability level. ' . '\ 

' 1 ^ y - ' ^^^^ influence of item d Isicrimlnat ion on differential 

validity also varied as a function of trie teating strategy* In the majority 
prediction condition, ^ t\\^ level of- Item discrimination which led to the small-* 
est dej^rce of differential validity appeared to be lower than might have been 
suspt^ctcd for both conventional and adaptive tests. For the peaked conventional 
test, the lowest discriminrition value led to the least differential validity, 
Hnwever» with both the majority prediction \^ersion oX the adaptive test and the 
uniform conventional tests, the intermediate level uf item discrimination led 
to the least differential validity. 

That d i f f erent i ,il validity was not directly related to level of item 
discrimination may scam surprising, since both in this study and in Urry's (1969) 
study^ it was shown th.'^t validity increased ,with increased item discrimination 
for the; difficulty Levels t^xamined. The essential factor, however, is that 
when validity is considered with regard to fairness fl,c,, differential 
valirilty), the effect of item bias tjiust be considered. The presence of item 
bias effectively increases, the difficulty parameter (.-) for the minority 
subgroup while leaving the h level unchanged for the majority testees. Since 
validity is a function of both h and item discrimination (a), the reported 
effect in differential validity resulted*,. — 

ll appears^ that when the same prediction parameters ara^sed in a conven" 
.tional test for all subgroups, the usual practice of selecting items having the 
fiLghest fl Lscr imlnat Ions will gcneralJv have the effect of Increasing subgroup 



validity differences if test items are biased* The more biased the items are, 
the larger will ,be the difference in subgroup validities. reduction in 
validity differences can be achieved by using intermediate levels of item 
discrimination; of course, then the reduction in differential validity will 
have been achieved at the cost of lowering the overall level of validity. In 
some fsttuations, particularly if the majority subgroup validity is relatively 
high, such a tradeoff may be desirable. 

Differential pY'ediction' in the adaptive test . The differential prediction 
version of the adaptive test, however, provides a means of controlling the 
level of differential validity while maintaining a high level of validity. 
Urhen this strategy uas used, it produced both the smallest differential 
validity and the highest overall validity for both subgroups. Apparently, in 
terms of validity, fairness is most readily attainable by adopting the testing 
strategy which has the greatest ability to adapt' to the individual testee- 
Among the testing strategies investigated in this study, this was the differen- 
tial prediction version of the Bayesi"^n adaptive test, in which test items were 
Selected for given testee on the basis of - the item parameters derived for the 
uestee 's subgroup. 

In the context of this study, the C^-index, based on Cleary's faii'ness 
model, i^ave the degree of statistical bias in the estimation of a known value 
of ability (G) . The T-index, based on Thorndike's definition of fairness, 
reflected the meaning of these mis-estimat ions in terms 'of the percentage of 
applicants who were predicted to exceed some qualifying point of'ability — in 
this case, the mean of the mp,jority .population. 

The Cleary view of fairness tends to optimize selection from the vantage 
point of the selecting institution, since it assures that the ablest candidates 
will be selected. The Thorndike model tends to be more liberal from the view- 
point of the minority subgroup* Even in situations in which the Cleary index 
indicates a perfectly f;iir test^ it has been previously shown by Schmidt and 
Vlunter (197A) that the Thorndike index may still indicate unfairness. This 
result was replicated both in the previous study (Pine & Weiss, 1976) and in , 
the present study. ^ 

From the previous study it was shown that even within conventional rests, 
ttie spread of item difficulties can ha>?e a strong effect on fairness at some 
levels of item dl?;crimination and for some test lengths. For the levels of 
discrimin^^.tion and test lengths most commonly found in practice, ^he general 
finding u'as that the peaked test was fairer in terms of the C"index and the 
uniform test was f^ilrer in terms of the T-index, when majority prediction was 
employed. The differential prediction condition indicated the conservative 
nature tif the '7-indQx, By definition, in this condition all testS'^were perfectly 
fnir by the Cleary model; yet the T^-index indicated the presence of substantial 
unfairness, particularly for very short tests or for tests composed of highly 
discriminatinj;; items. Furthermore, with differential prediction of ability, 
the use of tefsts with uniform distributions of item difficulties was consis- 
tently mor^ favorable to the minority subgroup, 

'-^n v^^^Kt 'o mlj r :)^' rn?^r . 'i daptiv^^ f ^ s^^s^ In the present fst udy Bayesian adap- 
livi: tests wt^re conip<ired to tho same convention;^! tests used in the previous 
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study (Pine & Welss^ 1976) in order to determine the effects of testing 
strategy on test fairness. For the levels of item discrimination and test 
lengths most commonly found in practice^ the general finding was that the 
adaptive tests were fairer than either the peaked or uniform conventional tests 
in terms of both the C- and ^T-indices when majority prediction was employed* 
The advantage of the adaptive tests over conventional tests was increased 
further by adjusting the prior ability estimate used for the minority subgroup. 

« 

In the differential prediction condition using the f-index, all conven* 
tional tests are perfectly fair by definition; therefore^ they cannot be 
improved by adaptive testing* ^Yet » at high levels of item discrimination, the 
adaptive test approached this ideal level of performance* On the !r-index^ 
there was an even greater advantage in favor of the minority subgroup 
using adaptive tests^as compared to conventional tests, when differential 
prediction was used rather than majority prediction* 

Iten disori:iination . All testing strategies in the majority prediction 
condit ion showed an overall improvement in fairness on the T'-index as item 
discrimination decreased* The C-index also displayed this relationship and 
proved to be less affected by Increases in item bias at low levels of item 
discrimination* These findings are disturbing, since it does not seem logical 
that "poor*' items should have to be used in order to achieve fairness, particu- 
larly since their use reduces overall validity* 

However, this result Is an artifact of using the majority prediction 
parameters for both subgrouos v/ith either conventional or adaptive testing 
strategies* This can be seen in Figure 7* It is obvious from Figure 7 that 
since tlit.^ minority subgroup mean is predicted through the majority prediction 
! i^e, the mean predicted ability level of the minority subgroup ' will be highesit 
when the s]ODe of the regression line (and therefore T") is lowest. Since the 
levels of item discrimination examined in this report were directly proportional 
to r (the correlation of observed and estimated ability levels)^ it follows that 
the predicted mean ability levels of the minority subgroup will increase as 
item discrimination dec r Ccises * Both the C~ and ? 1 nd ic es inu i cat ed inc r eas ed 
fairness as the mean of the minority subgroup increased* Therefore, it follows 
that fairness as. measured by both the C- and ^T-indlces should improve as item 
d i scriminat ion is decreased . 

/ ; > - - >>' -f -^ r V \ * Another problem which dccurs with 

the use of majori ty ' predict ion Ls that the test items which are optimal with 
rc'^pect to reducing differential validity are not the same itr-ins that are 
optimal with respect to the and ^-indices of fairness* This dilemma can be 
resolved wic?h the adaptive testing model by using differential prediction* 
Overall fairness was optimized in this rase in a logical^ consistent manner for 
ill 1 t^ree t>f the fairnesji^ indiccf;. With differential prediction, fairnesJ^ as. 
measured hy the R~ ^ , and T-indices was dramatLcally Increased for all levels 
of item d iscr ininat ion and item bias. Fur t liermore , adaptive testing displayed 
decreased sensitivity to increasin;: ittim bias witli rcsf>ect to each of the fair- 
nes5? Indices as item ci isc rim inat j nn was inr- ro;is<.vK 

Tho effeet of d i f f oren t ia 1 prediction wiLhLn the context of tlie conventional 
Ci^^st mndol couLd be observed only on the .'-inde>;, since the .^-indeK was un- 
changed and "=0i by definition* Both the uniform and peaked tests showed a 
marked improvement In fa^rnf.^HS under d if f eren t ic* I j^rcd ict i on , although not ati 
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much as was found with the adaptive test » However^ even when differential 
prediction was employed^ the peaked test still was most robust to increasing 
item bias and fairness overall at the lowest level of item discrimination* 
Consequently^ it would appear that when using conventional tests, a high degree 
of fairness can be obtained (both with respect to the f-index and differential 
validity) by peaking item difficulties and using items with a relatively low 
level of discrimination* This policy^ however^ would decrease the overall 
level of test validity* 



Figure 7 

Linear Prediction of Minority Mean 
Ability Using Majority Prediction Line 




min m^i] 



The uniform conventional tests produced results which were much more similar" 
to those found with the adaptive tepits* Differential prediction improved 
fairness at higher levels of discrimination* This similarity between the uni- 
fonn and ad^iptive tests may have resulted from both types of tests having a 
unirnrm spread of Ltem d t f f icult tes » 

20 



Prior abititij distributions ^ The prior ability distribution chosen to 
begin the Bayeslan ability estimation process also appeared to affect, the 
fairness of each cest. Prior ability estimates which undei'estimated the true 
ability caused an increased underprediction of the mean, compounding the degree 
of underprediction caused by Item bias. This underprediction was directly 
reflected In the C~ and f-fairness Indices. 

The choice of a prior ability estimate did not in Tluence differential 
validity in the same way as it did the C and T measures. For the ^^^^^ index 

(assuming the presence of item bias), prior ability estimates below the true 
populatioii ability levels actually improved fairness (i,e, * reduced differential 
validity) . This ..was a r esul t of t he f ac t t ha t t he pr es ence of i tem b ias made 
it appear as though the minority subgroup's mean ability level was lower than 
it really was; consequently, the low prior ability estimate was effectively more 
appropriate, producing a higl^er validity for that subgroup. 

The conflicting results of the effect of . prior ability -estimates on test 
fairness were not alleviated by using differential prediction. In the differen- 
tial prediction condition, prior ability estimates higher than true ability 
levels led to estimation bias, this time in favor of the minority subgroup. 
The higher priors .still, however, led to a less favorable result with respect 
to differential validity* 

There is a need for more research on the best procedures to follow in 
choosing subgroup prior ab^ility estimates in Bayeslan adaptive testing. The 
problem is that if different priors are used for each subgroup, based on avail- 
able ability data, certain minority subgroups may be unfairly affected. This 
might occur tor subgroups that have tended to score lower on past tests — which 
may have been biased. Lower mean prior ability estimates would be used for 
members of these subgroups which would then lead to lower levels of estimated 
ability with the Bayesian adapt ive, testing method. On the other hand, if 
identical prior ability estimates are used for all subgroups and there is a 
true ability level difference between the subgroup;?* those subgroups that are 
overestimated will be given an unfair advantage* Furthermore, this advantage 
will be much larger than the disadvantage in faimes."? that results from using 
a prior ability estimate which underpredict s true ability. However^ data from 
the current study suggest that both of these undesirable outcomes" can be 
minimized by increasing test length, 

7hc * ::ai^esian Error of FsLinate 

In tile present study the adaptive testinj^ termination criterion was always 
based on test length. However, -the observed posterior standard deviation (s ), 
which can be thought of as an estimate of the standard error of the final 
ability estimate , has also been suggested as a test termination criterion 

In the Bnyeslan adaptive test ,(Jensema» 1974; Urry, 1977)* Therefore, it was 
of interest to determine how test fairness is influenced under this alternative 
method of test termination* 

One apparent advantage of using is that it would seem to provide a means 

of reducing d i f f e rt.>n t i 1 validity, sincx? all subp,roups would jvlmply be tested 
to the same estimated error level. Since validity, bears an inverse relationship 
to the standard error nf esl^imate fUrrv, 1977), all subgroups should attain 
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nearly equivalent validities. The crucial factor, then. Is how well reflects 

the actual standard erroi: of estimate. For example, if It understates the 
actual testing will be prematurely terminated, resulting in lower test 

validity. The results oi the present study indicate that this is exactly what 
happens , 

Even when the test items were unbiased, the ratio of the average to 

the theoretical SEE ^||creased with increasing test length and levels of item 
discrimination. When item bias was introduced, the ratio decreased with increas 
ing item bias, Conse<iuently , the more biased the test Items are, the more 
j.ikely it is that differential validity will occur when Bayeslan adaptive tests 
are terminated using the observed s^. The data also suggest that differential 

validity can be expected to increase the longer the testing process is allowed 
to continue. 

In using the Bayesian adaptive testing strategy to reduce unfairness, it 
might be possible to compensate for the reduction in validity by differentially 
setting for each subgroup. Appropriate levels for could be estimated 

from the data shown in.Figures 5 or 6, One problem in devising such a compen- 
satory method is that it could lead to a substantial difference in the aver^^ge 
^ number of test items taken by each subgroup! it has been shown both In the 
present study and In previous studies that test length affects the other 
fairness indices. 

Arj^vantagas of differential prediction. When differential prediction was 
, used, the average became a much better estimator of the theoretical value 

and' was not nearly as adversely Influenced by Item bias* At a=l.l It was 
quite robust with respect to increasing item bias; and even though the under- 
prediction of error increased as item discrimination increased, the use of 
items with^/hlgh discriminations is likely to lead to comparable degrees of 
underprediction for all Subgroups, Moreover, with differential prediction, 
all three of the f^^irness indices gave convergent implications for the fairness 
of the adaptive test. Therefore, if Bayesian adaptive testing is terminated 
on the basifj of observed values of s^. It should be employed within the 

differential prediction model studied here, in which Items are sequentially 
selected for administration on the basis of subgroup item parameter values. 



All current interpret^^tions of the fairness of a selection procedure depend 
on the diKtributton of predicted criterion scores. Both in the present study 
and in the previous . study by Pine and Weiss (1976), it has beefi shown that the 
distribution of , predicted scores^ will vary as ' a function of item characteristics 
,and testing strategy, l.e,» adaptive vei^sus conventional. Therefore, even 
a<f¥iaming that a selection test is totally free of items biased against^ a 
particular subgroup, measured "fairness''' will vary as a function of the Item 
characteristics and strategy for selecting items and scaring a test. Thus, the 
fairness of a test — and consequently a selec-tlnn program using that test — can 
vary even tf It contains no biased items. X 
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If . in addition, a selECtion test contains some degree of bias in its 
iteins» the situation is compounded* The results of tnese studies have shown 
that the extent'to which biased items influence selection fairness depends 
on the testing strategy* Some strategies are more sensitive to the presence 
of item bias than are others- 

In comparing tlie Bayesian adaptive testing strategy to conventional 
tests* it was found that the adaptive test was consistently fairer than the 
conventional tests for tests of 30 or more items with discrimination levels of 
a=*»70 and higher* Furthermore, the aifferential prediction version of the 
adaptive test produced almost perfectly fair performance on all fairness 
indices at high levels of item discrimination. Within the Bayesian strategy 
it was found that the choice of subgroup prior ability estimates affected 
test fairness- These effects were minimized by using differential prediction 
and by increasing test length. Finally, the use of obs^*rved values of the 
Bayesian jposterior error of estimate to terminate the Bayesian adaptive test 
does not assure the reduction of differential validity and can lead to increased 
unfairness * 

One point cf caution which needs to be made concerning the use of the 
Bayesian adaptive testingmodel is that great care must be taken in choosing an 
appropriate prior ability distribution for each subgroup* There are essen- 
tially two policies which can he followed for each subgroup: (1) using different 
priors or (2) using identical priors- However, both of these options can 
result in unfairness — against the minority subgroup in the former cas^ and 
against the majority subgroup in the latter* Obviously, a dilemma exists- 
Until this dilemma can be resolved through further research, the use of 
equivalent prior ability estimates for the majority and minority subgroups is 
probably advisable, since this will^result In the minimum adverse impact on 
minority subgroups* 

Future t;esearch efforts on the reduction of test unfairness should be 
concerned with the problem of prior selection in population subgroups. In 
addition, other versions of the adaptive approach to testing should be examined 
to determine their 'effects on test fairness* There is also a need to 
determine the kinds and extent of intergroup differences in test item diffi- 
culties and discriminations that occur in live-testing situations. These 
d^ifferences should then be Incorporated into future simulation studies to 
determine their Interactions with testing strategies and their effects on 
test fairness. Finally, it Is most important that the findings based on theo- 
retical and simulation studies be verified in a live-testing situation. 
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APPENDIX: SUPPLEMENTARY TABLES 

5cot> OlAcrlbuclon Ch«r^cterl«ttc« for CoQventloadl T««C9 of Langth XO* * 
function of DlAcrlal^Atlon (a)* and Ctoup* for thilfots and P««k«d T««c« 
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10 30 SO 70 

a BIAS Croup U P U P^U PU P ^ F 

.30 -. . 

0.0 ^ ftJj 38.4 56. S 45.4 47.4 41.6 43. d 44.0 46.2 44.0 49.8 

.5 >ln 30.4 46.2 31. a 33.8 28.0 2fi.2 29.6 30.2 30.6 30.8 

dlff . -8.0 -10.6 -13.6 -13.6 -13.6 -IS. 6 -14.4 -16.0 -13.4 -19.0 

L.O Bin 21.6 37.4 20.4 21.4 18.0 17.2 18.2 19.0 17.2 19.2 

dtff -16.8 -19.4 -25.0 -26.0 -23.6 -26.6 -25.8 -27.2 -26.8 -30.6 

^ 2.0 min 10.0 20.2 7.8 7. 2 4.4 3.6 4.4 4.6 4.6 4.6 

dlff ' -28.4 -36.6 -37.6 -40.2 -37.2 -40.2 -39.6 -41.6 -39.4 -45. 2 
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D.O 40.6 53.0 50.2 45.0 46.2 49.4 48.0 49.8 48.2 49.2 

.5 ■In 26.6 35.4 34.4 29.2 28.4 30.4 32.0 29.8 31.0 29.2 

dlff -14.0 -27.6 -15.8 -15.8 -17.8 -19.0 -16.0 -20.0 -17*2 -20.0 

1.0 alri 16.4 22.8 19.6 15^0 14.8 |5.6 16.2 15. ft 15.4 14.8 

dlff -24.2 -30.2 -30.6 -30.> -31.4 -33.8 -31.8 -3^.4 -32.8 -3ft. 4 

2.0 otn 4.6 .6.2 3.8 3.6 2.8 3.6 2.4 3.^ 3.0 3.2 

dlff -36.0 -46.8 -46.4 -41 .4 -43.4 -45.8 -45.6 -46.4 -ft5.2 -46.0 
1.1 . ^ 

0.0 ftjj 40.0 52.6 47.2 46.8 47.8 47.6 47.8 48.4 ^8.8 50. 0 
■5 Bin 25.2 35.2 29.6 29.4 28.0 29.2 27.8 29.2 29.0 29.2 
dlff -14. e -17.4 -17.6 -17.4 -19.8 -18.4 -20.0 -19.2 -19.8 -20.8 
1.0 mln 15.0 20.4: 17.4 15.2 15.8 14.6 13.8 15.4 IS. 4 15.4 
dlff -25.0 -32.2 -29.8 -31.6 -32.0 -33.0 -34.6 -33.0 -33.4 -34.6 
2.0 Bin - 3. ft ft. 8 3.0 3.4 2.4 3. ft 2.2 3.0 2.6 2.6 
dlff -36.6 -47.8 -44.2 -43.4 -45.4 -44.2 -45.6 -45.4 -46.2 -ft7.4 



^ Table 8 

5cor« Distribution ChAr«cc«rl«tlca Cor Conventional Teot* of Length 30, oa o 
fimccloo of Dlocrlmtaotloo (u) • Blao* ood Croup* fot Uniform and Feakod T«oco 

Tcflt Lenath 



iO 30 50 70 100 
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F 
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56.8 


45.4 


47.4 


41.6 
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44.0 


46.2 


44.0 


49.8 




(bin 


52 


.6 


46.2 


40.6 


42.6 


46.4 
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46.8 


45.4 




dlff 


14 


2 


-10.6 


-4.8 


-4.8 


6.8 


ft.O 


L.O 


-4.2 


2.8 


-ft. 4 
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41 


.8 


37.4 


49.2 


47.8 
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44.4 


44.4 


46.2 
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3 


4 


-19. i 


3.8 


.4 


5.6 


.6 


,4 
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(Bin 


43 


6 


33.0 


41.4 
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45.2 


41.4 


44.2 


44.0 


42.4 




dlff 


5 


2 


-23.8 


-4.0 


-8. ft 


3.0 


1.4 


-2.6 


-2.0 


0.0 


-7.4 


0.0 
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40 


.6 


53.0 


50.2 


45.0 


ft6.2 


49. ft 


48.0 


49.8 


48.2 


49.2 


.5 


□tin 


47 


.8 


47.8 


48.2 


41.6 


47.0 


4^.8 


47.4 


46.4 


45.8 


46.4 




dlff 


7 


2 


-5.2 


-2.0 


-3. ft 


.8 


-3.6 


-.6 


-3. ft 


-2.4 


-2.8 


l.O 


(Bin 


49 


8 


46.0 


45.0 


ftl .4 


47.0 


(t4.8 


45.6 


41.8 


47.6 


42.8 




dlff 


9 


2 


-7.0 


-5.2 


-3.6 


.8 


-4.6 


-2.4 


-8.0 


-.6 


-6. ft 


2.0 


Bin 


41 


2 


31.2 


44.0 


36.6 


40.8 


38.2 


39.4. 


38.8 


42.4 


37.4 




dlff 




6 


-21.8 


-6.2 


*8.4 


-5. ft 


-11.2 


-8.6 


-H.O 


-5.8 


-11.8 


0.0 




40 


0 


"52.6 


47.2 


46.3 


47.8 


47.6 


47.8 


48.4 


48. B 


50. 0 


.5 


(Bin 


46 


0 


43.6 


43.8 


42.0 


45.6 


44.0 


48.2 


43.2 


47.4 


45.0 




dlff 


6 


0 


-9.0 


-3.4 


-4.8 


-2.2 


-3.6 


.4 


-5.2 


-l.i 


-5.0 


t.O 


aln 


48 


4 


36.2 


48.8 


39.6 


42.6 


39.6 


46.6 


38.4 


45.6 


39.4 




dlff 


8 


4 


-16.4 


1.6 


-7.2 


-5.2 


-fl.O 


-t.2 


-10.0 


-3.2 


-10.6 


2.0 


Bin 


51 


0 


27.6 


40.8 


28.6 


ft3.0 


30.8 


43.0 


30.8 


44.0 


29.6 




dlff 


tl 


0 


'25.0 


-6.4 


-18.2 


-4.8 


-16.8 


-ft.B 


-17.6 


-ft. 8 


-20.4 
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Table C 

Values of the C-tndex for tinlfom (tl) and P«ak«d (p> Conventional Teats and 
for the Bayealan Adaptive Teat (BAT), using Majority and Hlnorlty Prediction, 
and for Majority (maj) and Hlnorlty (nln) Subf^roupe, and Subgroup Dllferences 
(dlff), as a Function of Item Discrimination (^)« Degree of Item Bias, for Tests 

of 10* 30« 30 Items 



Majority Prediction Differential Prediction 

10 Ueps 30 Items 50 Items 10 Items 30 Items 50 Items 

Bias Croup U _P BaT U P SKT U P bAT_ BAT BaT BAT 

maj .000 .000 -.02^ .000 .000 -*l^9 .000 .000 -.177 -,02^ -.U9 -.177 



. 5 


mln 


— . 


12^ 




12^ 


-.171 


' -.258 


-.255 


-.394 


-.317 


-r328 


-.492 


-.045 


-.129 


-.171 




d 1 f f 




12^ 




\2i, 


- . 14 7 


-.258 




-.245 


-. 31 7 


. 3ia 


-. 315 


-.021 


020 


. uuo 


1.0 


nln 


-. 






264 


-.303 


-.527 


-.534 


-.621 


.635 


-.664 


-.790 


-.035 


-.111 


-.161 




d If f 




25A 




Ihi, 


-. 279 


-. 527 


534 


-.472 


. 635 


-.664 


-.61 3 


. 0] I 


. 038 


.016 


2.0 


min 


-. 


509 


- 


530 


-.586 


-1.033 


" -1.023 


-I. IIB 


1.262 


-1.286 


-1.353 


-.002 


-.097 


-.156 




d If f 




509 




> jU 


- . 562 


-I * 03 3 


-I . D2 3 


-. 969 




- 1 . 286 


-1.176 


. 022 


. 052 


. 021 




maj 








000 


-.04 4 


*000 


* flUO 


-. Ul 


. 000 


.000 


-.154 


-. 044 


-.141 


-.154 




mln 




283 




319 


-.359 


-.377 


-.4 34 


-.51H 


-.422 


-.461 


-.561 


-.074 


-.115 


-.141 




d i t r 




J 




319 


-* 315 


-. j7 7 


- *4 34 


-. 377 


-.4Zi 


- . 461 


- . 4U7 


-. 030 


. U26 


.013 


l.O 


min 


-. 


586 


- 


623 


-.651 


-.801 


-.837 


-.885 


-.879 


-.894 


-.963 


-.073 


-.105 


-.141 




dl f f 




>ao 




623 


- . OU7 


-*BOl 


a ^ 9 
- . o j7 


-. 744 


- . o7 7 


- . 074 


-.809 


- . 02 7 


. U Jo 


. 013 


2.0 


min 


-1 


Ul 


-I 


U2 


-1.150 


-1.533 


-U524 


-1. 56B 


-1.675 


-1.634 


-1.682 


' -.070 


-.121 


-.156 




dlf f 


-I 


lU I 


-1 




-1. 106 


-1 . 533 


-1 . 524 


-1.427 


-I . 675 


-1 ; 634 


-1 . 528 


-.026 


.020 


-.002 




maj 




000 




ooo 


-.073 


.000 


.000 


-. 103 


. 000 


.000 


-.119 


-. 073 


-.103 


-. 119 


. 5 


mLn 




361 




411 


-.4 3 * 


-.433 


-.4 32 


- . 509 


-.462 


-.462 


-.541 


-. 095 


-.100 


-.121 




dif f 




361 




411 


-.361 


-.433 


-.452 


-.406 


-.462 


-.462 


-.422 


-.*022 


.003 


-.002 


1.0 


ftiin 




739 




780 


-.745 


-.901 


.861 


-.896 


-.946 


-.882 




-.079 


-.097 


-.123 




dlf f 




739 




780 


-.672 


-:9oi 


.861 


-.793 


-.946 


-.882 


-.827 


-.006 


.006 


-.OOA 


2.0 


min 


-1 


480 


-1 


298 


-K227 


-1.760 


-1.470 


-1.538 


-1 .778 


-1.499 ^ 


-1.599 


-.068 


-.100 


-.121 




dif f 


-1 


^80 


^-1 


298 


-1. 154 


-1.760 


-1.470 


-1.435 


-1.778 


-1.499 


-1.480 


.005 


.003 


-.002 


Priori 


































maj 










.073 






-.103 






-.119 


-.073 


-.103 


-.119^ 


-1.0 


mln 










-1.079 






-1.097 






-1.123 


-.303 


-.190 


-.20A 




maj 










-K006 






-.994 






-1.004 


-.230 


-.087 


-.085 


-.25 


min 










-.257 






-.572 






-.661 


.255 


.104 


.054 




maj 










-.184 






-.469 






-.542 


.328 


.207 


.173 


+ 1.0 


min 










-.883 






-.989 






-1.028 


-.145 


-.122 


-.149 




m^ J 










-.610 






-.886 






-.909 


-.072 


-.019 


-.030 
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Table D 

Values ot the T-Index for Unlfota (U) and Peaked (P> Conventional Tests and 
for the Bayefllavi Adaptive Teat (Bat), using Majority and Miviorlty Predlctlofi, 
and for Majority (maj) and Minority (nln) Subgroups, avid Subgroup Differences 

(dlffK as a Function of Item Dlscrl«lfiat Ion (^), Degree of Iten Bias, for 
Teats of 10, 30, and ^0 Itenft 

Majority Predlctloyi Olfferentlal Prediction 



10 Itews 30 Uems jO I teres 10 Items 30 Items ^0 Items 





Group 


U 


. P 


BAT 


U 


P 


' BAT 


U 


P 


BAT 


U 


P 


BAT 


U 




P 




BAT 


U 


P 


3AT 


.30 0.0 


[QaJ 


38.1^ 


56 .3 


45. i 


45.2 


47.4 


36.6 


^1.6 


il 


8 


36.0 


38.4 


56.8 


i5.4 


45. 


4 


47 


4 


36.6 


41.6 


43.8 


36.0 


. 5 


min 


30. i 


46. 2 




31.8 




23. 6 


28 . 0 


28 


2 


21.3 


52.6 


i6. 2 


4i . 6 


40 




42 


6 


J / . ^ 


48.4 


47.8 


JO . u 




dlff 


-a.o 


-10.6 


-10.6 


-13.6 


-13.6 


-13.0 


-13.6 


-IS 


.6 


-U.2 


I4.2 


-10.6 


-.8 




8 


-4 


8 


.6 


6.8 


4.0 


0.0 


1.0 


Tnln 


21.6 


37.4 


26.0 


20.4 


21.4 


15.8 


18.0 


17 


.2 


12.6 


^1 .B 


37.4 


45.2 


49 


2 


47 


8 


38. B 


47.2 


44.4 


35.6 




dlff 


-u.a 


-19. i 


-1'?.4 


-25.0 


-26.0 


-20.8 


-23.6 


— i D 




-23. i 


3.^ 


-19.4 ^ 


-.2 


3 


8 


-U 


0 


2.2 


5.6 


-.6 


-.4 


2.0 




10.0 


20.2. 


13.0 


7.8 


7.2 


4.6 


i.4 


3 


6 


3.6 


i3.6 


33.0 


53.6 


il. 


4 


39 


0 


41. S 


44.6 


45.2 


38.0 




5lff 




-3ft. 6 


-32. i 


-37.6 


-40.2 


32.0 


-37.2 


-4U 




-32. i 


5.2 


-23.8 


8.2 


-i. 


0 


-8 


L 


S.2 


1.0 


1.4 


2.0 


.70 0.0 


maj 


iO.6 


53.0 


i2.0 


50.2 


45.0 


37. i 


i6.2 


i9 


£* 


3f^.8 


40.6 


53.0 


^2.0 


50 


2 


45 


0 


37.4 


46.2 


49.4 


36.8 






26. 6 


35.4 


'j^ ft 


3i . 4 


2 9.2 


22. 2 


28 . it 


30 


U 


2 1.^8 


Ul . 8 


47.-8 


38. 2 


48 


2 


4 1 


^ 




47 . 0 


45.8 






dlff 


-U.O 


-17.6 


- 16.2 


-15.8 


-15.8 


-IS. 2 


-17.8 


-19 


0 


-15.0 


7.2 


-5.2 


-3;e 


-2 


0 


-3 


4 


l.i 


.8 


-3.6 


2.4 


1.0 


ntln 


16.4 


22 . 8 


15.2 


19.6 


15.0 


12.8 


14.8 


IS 


6 


11.^ 


i9. 8 


^6.0 


J9 . ^ 


4S 


0 


4l 


4 


43.4 


47.0 


44.8 


41.0 




dlff 


-24.2 


-30.2 


-26.8 


^30.6 


-30.0 


-2^1.6 


-31. 


-33 


8 


-25.6 


9.2 


-7.0 




-S. 


2 


-3 


6 


S.O 


.8 


-4.6 


4.2 


2.0 


mln 


i.6 


6.2 


J. 2 


3.8 


3.6 


2.0 


2.8 


3 


6 


1.6 


il.2 


31.2 


^O.B 


4i 


0 


3fi 


6 


38.8 


40.8 


J8.2 


36.6 




dlff 


-36.0 


-46.8 


-38. a 


-46. i 


-4 1 . i * 


-35.4 


-43 . A 




□ 


-35.2 


. 6 


-21.8 


-1.2 


-6 


2 


-8 


4 


1.4 


-5.4 


-11.2 


- . 2 


. 1 0.0 


ma) 


^0.0 


52.6 




47.2 


4^»,8 


38.6 


47.8 


A7 


6 


J8.e 


iO.O 


52.6 


36. i 


47 


2 


46 


8 


38.6 


47.8 


47.6 


38.8 


.5 


ntn 


Zbh 


15.2 


21.6 


29.6 


29. i 


21.0 


29.0 


1 a 

£7 




22. i 


i6.0 


i3.6 


37.2 


43. 


8 


42 


0 


40.6 


45.6 


44.0 


40.0 




dlff 


-li . 8 


-17 . U 


— 1 4 . tJ 


-17.6 


-17.4 


-17.6 




-18 




-16.i 


S 0 


-9 0 


. 8 


_ 3 




- 4 




^ . u 


-2.2 


-3.6 


i . ^ 


1.0 


mtn 


1S,0 


20. U 


11,8 


17. i 


15.2 


12.0^ 


15.8 




6 


12.0 


^8,4 


36.2 


37.8 


46 


8 


39 


6 


40.4' 


'.2.6 


39.6 


> 

40.0 




dtff 


-2S.0 


-32.2 


-2i.6 


-2'?.8 


31.6 


-26.6 


-32.0 


'33 


0 


-26.8 


6.4 


-16.4 


1.4 


1. 


6 


-7 


2 


1.8 


^S.2 


-8.0 


-.4 


2.0 


mln 


3.i 




' 1.8 


3.0 


3.i 


1.2 


2.4 


3 




1.2 


51.0 


27 .6 


39.0 


40. 


3 


28 


6 


39.6 


43.0 




39,2 




dlff 


-36,6 


-47.8 


-34.6 


-44.2 


-^3,i 


07.i 


-i5.i 




2 


-37.6 


11.0 


-25.0 


2.6 


-6. 


4 


-18 




1.0 


-4.8 


16.8 


.4 
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